DHT-Based Distributed Crawler

نویسندگان

  • Owen Cooper
  • Sailesh Krishnamurthy
چکیده

A search engine, like Google, is built using two pieces of infrastructure a crawler that indexes the web and a searcher that uses the index to answer user queries. While Google's crawler has worked well, there is the issue of timeliness and the lack of control given to end-users to direct the crawl according to their interests. The interface presented by such search engines is hence very limited. Since the underlying index built out of the crawl is not exposed, it is difficult to build other applications, such as focused crawlers (eg Bingo! [1]) that try to do more than just return a set of URLs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Crawling BitTorrent DHTs for Fun and Profit

This paper presents two kinds of attacks based on crawling the DHTs used for distributed BitTorrent tracking. First, we show how pirates can use crawling to rebuild BitTorrent search engines just a few hours after they are shut down (crawling for fun). Second, we show how content owners can use related techniques to monitor pirates’ behavior in preparation for legal attacks and negate any perce...

متن کامل

Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

1 This paper describes a decentralized peer-to-peer model for building a Web crawler. Most of the current systems use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawled results are managed in a centralized system using a centralized URL repository. Centralized solutions ar...

متن کامل

Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System

Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental...

متن کامل

Distributed High-Performance Web Crawler Based on Peer-to-Peer Network

Distributing the crawling activity among multiple machines can distribute processing to reduce the analysis of web page. This paper presents the design of a distributed web crawler based on Peer-to-Peer network. The distributed crawler harnesses the excess bandwidth and computing resources of nodes in system to crawl the web. Each crawler is deployed in a computing node of P2P to analyze web pa...

متن کامل

IglooG: A Distributed Web Crawler Based on Grid Service

Web crawler is program used to download documents from the web site. This paper presents the design of a distributed web crawler on grid platform. This distributed web crawler is based on our previous work Igloo. Each crawler is deployed as grid service to improve the scalability of the system. Information services in our system are in charge of distributing URLs to balance the loads of the cra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003